[ALF] Add index adjustment for UTF-8 indices#8056
Conversation
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. |
📝 PRs merging into main branchOur main branch should always be in a releasable state. If you are working on a larger change, or if you don't want this change to see the light of the day just yet, consider using a feature branch first, and only merge into the main branch when the code complete and ready to be released. |
Generated by 🚫 Danger |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request implements UTF-8 to UTF-16 index conversion for citations and grounding metadata, ensuring accurate text segment mapping in the public API. Key changes include the introduction of a convertUtf8IndexToUtf16 utility, updates to the Candidate model hierarchy, and the addition of comprehensive unit and instrumentation tests. Feedback highlights a potential out-of-bounds error when handling malformed surrogate pairs, performance inefficiencies due to redundant content scans, and compatibility issues with android.util.Log in non-Android test environments.
The AI Logic endpoints return citation indices based on UTF-8 bytes, but Java and Kotlin use UTF-16 natively. This means that the provided indices are often offset from actual content and can even point out of bounds, making them very difficult to use.
Applies both to citation metadata and grounding.
Testing was added for all validation to ensure that grounding indices match completely with what is provided, currently passing all extant testing. Further testing was added to force grounding with citation using strings which will differ in length in UTF-8 and UTF-16 (the degree symbol and accented letters are multi-byte unicode characters).
Manual testing was done for accurate indices in practice, exhaustive testing is difficult, as citation is generally rarer. Grounding is easy to force and test.